Efficient Block Cyclic Data Redistribution
نویسندگان
چکیده
Implementing linear algebra kernels on distributed memory parallel computers raises the problem of data distribution of matrices and vectors among the processors. Block-cyclic distribution seems to suit well for most algorithms. But one has to choose a good compromise for the size of the blocks (to achieve a good computation and communication eeciency and a good load balancing). This choice heavily depends on each operation, so it is essential to be able to go from one distribution to another very quickly. We present here the algorithms we implemented in the SCALAPACK library. A complexity study is made that proves the eeciency of our solution. Timing results on the Intel Paragon and the Cray T3D corroborates the results. We show the gain that can be obtained using the good data distribution with 3 numerical kernels and our redistribution routines. Redistribution eecace des donn ees stock ees par blocs entrelac es R esum e : L'implantation de noyaux d'alg ebre lin eaire sur les machines parall eles a m emoire distribu ee pose le probl eme du choix de la distribution des donn ees pour les matrices et les vecteurs sur les dii erents processeurs. Une distribution bloc-cyclique semble convenir pour la plupart des algorithmes, mais un compromis est n ecessaire dans le choix de la taille des blocs (pour avoir a la fois des calculs et communications eecaces et une bonne r epartition de charge). Le choix optimal est dii erent pour chaque algorithme, et il est donc essentiel de pouvoir passer d'une distribution a l'autre tr es rapidement. Nous pr esentons ici les algorithmes de redistribution que nous avons implant es dans la biblioth eque SCALAPACK. Une etude de complexit e vient ensuite prouver l'eecacit e des solutions choisies. Les performances obtenues sur Intel Paragon et Cray T3D corroborent nos r esultats. Nous montrons le gain obtenu en utilisant une bonne distribution des donn ees avec 3 noyaux de calcul numerique et nos fonctions de redistribution.
منابع مشابه
Efficient Methods for kr R r and r R kr Array
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for...
متن کاملMulti-phase array redistribution: modeling and evaluation
s t lcm lcm*2 lcm*4 gcd gcd/2 gcd/4 s t lcm lcm*2 lcm*4 gcd gcd/2 gcd/4 Table 1: Execution times (ms) for cyclic(s) to cyclic(t) redistribution on 32 processors. other block sizes t. Fig. 3 shows the total times in milliseconds for a cyclic(192) to cyclic(8) redistribution on 32 processors for increasing data sizes. This redistribution corresponds to the cyclic(Y t) to cyclic(t) case with Y = 2...
متن کاملPacking/Unpacking Information Generation for Efficient Generalized kr→r and r→kr Array Redistribution
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient methods to gen...
متن کاملIrregular Redistribution Scheduling by Partitioning Messages
Dynamic data redistribution enhances data locality and improves algorithm performance for numerous scientific problems on distributed memory multi-computers systems. Regular data distribution typically employs BLOCK, CYCLIC, or BLOCK-CYCLIC(c) to specify array decomposition. Conversely, an irregular distribution specifies an uneven array distribution based on user-defined functions. Performing ...
متن کاملA Basic-Cycle Calculation Technique for Efficient Dynamic Data Redistribution
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a basic-cycle calcu...
متن کاملMessage Scheduling for Irregular Data Redistribution in Parallelizing Compilers
In parallelizing compilers on distributed memory systems, distributions of irregular sized array blocks are provided for load balancing and irregular problems. The irregular data redistribution is different from the regular block-cyclic redistribution. This paper is devoted to scheduling message for irregular data redistribution that attempt to obtain suboptimal solutions while satisfying the m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996